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1.  INTRODUCTION 


Although  every  regular  set  is  described  by  some  regular  expression, 
there  may  not  be  any  reasonably  short  regular  expression  for  a  given 
regular  set.  Therefore,  people  tend  to  use  more  powerful  means  of  de¬ 
scribing  sets.  For  example,  the  complementation  and  intersection  oper¬ 
ators  can  be  used  to  form  concise  representations  of  some  sets  (see 
Stockmeyer  (1974)). 

Auxiliary  variables  are  commonly  used  informally  in  regular  expres¬ 
sions  (RE),  and  especially  finite  automata  (FA),  to  help  shorten  the 
representation  of  a  set.  For  example,  consider  a  lexical  scanner  of  a 
compiler,  which  is  comoonly  Implemented  as  a  finite  state  machine.  Sup¬ 
pose  a  certain  language  permits  decimal  Integers  of  the  form  Dn  ,  and 
binary  integers  of  the  form  Bn  .  An  ordinary  finite  state  machine 
recognizing  either  decimal  or  binary  Integers  would  read  either  D  or  B  , 
and  then  fork  into  two  separate  machines,  one  accepting  decimal  digits, 
the  other  accepting  binary  digits.  But  by  adding  an  auxiliary  Boolean 
variable,  call  it  dec,  the  binary  and  decimal  Integer  readers  can  share 
the  same  states.  The  machine  described  below  recognizes  decimal  and 
binary  integers. 

State  1:  on  input  B,  set  dec  ;«  false,  goto  2; 

on  input  D,  set  dec  true,  goto  2. 

State  2:  on  input  0  or  1,  goto  2; 

on  input  2,...,9,  if  dec  then  goto  2, 

else  stop  and  reject; 
on  end  of  string,  stop  and  accept; 
on  any  other  symbol,  stop  and  reject. 
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The  example  above  le  very  simple,  and  the  reader  may  consider  the  ordinary 
finite  state  machine,  without  auxiliary  variables,  more  appealing.  How¬ 
ever,  the  reader  can  no  doubt  supply  examples  from  his  own  experience  In 
which  a  few  well  chosen  variables  greatly  reduce  the  length  of  a  program, 
even  one  which  is  essentially  finite  state. 

In  this  paper  we  formally  define  regular  expressions  and  finite  au¬ 
tomata  with  auxiliary  Boolean  variables,  and  prove  some  results  concerning 
conciseness  of  representation  of  sets  in  the  various  models  (ordinary  KK, 
ordinary  FA,  RE  with  auxiliary  Boolean  variables  and  FA  with  auxiliary 
Boolean  variables). 

We  begin  with  an  informal  description  of  RE  and  FA  with  auxiliary 
Boolean  variables  (BRE  and  BFA,  respectively).  There  are  two  basic  types 
of  operations  on  a  variable:  setting  it  to  a  particular  value,  and  testing 
whether  it  contains  a  particular  value.  For  Boolean  variables,  there  are 
just  four  operations:  xt  (set  x  to  true) ,  xt  (set  x  to  false) ,  x? 

(is  x  true?) ,  and  x?  (is  x  false?) .  Each  may  appear  in  a  BFA  or  BRE 

wherever  an  alphabetic  symbol  may  normally  appear.  (Although  xt,  xt ,  x? 
and  X?  are  written  as  strings  of  two  or  more  characters,  we  think  of  them 

as  indivisible  symbols.)  In  a  BFA,  a  transition  labeled  xt  (xt)  acts  as 

a  null  move,  with  the  side  effect  of  setting  the  value  of  x  to  true  ( false) . 
A  transition  labeled  x?  (x?)  may  be  taken,  as  a  null  move,  provided  x  is 
currently  true  (false) .  When  x  is  false  (true)  ,  x?  (x?)  transitions  may 
not  be  taken. 

The  following  BFA  recognizes  the  decimal  and  binary  integers  described 
above . 
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State  1  is  the  start  state,  and  State  4  is  the  accepting  state. 

The  special  forms  xt ,  xl ,  x?  and  5c?  can  be  added  to  RE  as  well  as 
FA.  One  way  to  understand  BRE  is  to  imagine  a  BFA  "executing"  the  BRE. 

In  that  view,  when  xt  (xl)  is  encountered,  no  symbol  is  generated,  but 
x  is  set  true  (false) .  When  x  is  true  (false) ,  x?  (x?)  is  treated 
as  A  f  the  null  string.  When  x  is  false  (true) ,  x?  (x?)  may  not  be 
generated;  the  generation  must  be  aborted. 

Alternatively,  xt,  xt,  x?  and  x?  can  be  generated  just  as  if  they 
were  alphabetic  symbols.  The  strings  generated  in  such  a  manner  by  a 
BRE  must  be  passed  through  a  filter,  which  filters  out  any  string  which 
contains  xt  t  x?  or  xt  t  x?  as  a  substring,  where  t  does  not  contain 

xt  or  xt  •  As  all  variables  are  arbitrarily  assumed  to  be  false  initially, 

strings  containing  substrings  of  the  form  tx?  must  also  be  filtered  out. 
After  filtration,  all  occurrences  of  xt,  xt ,  x?  and  x?  are  erased.  The 
formal  definition  of  BRE  given  here  uses  this  filtration  approach. 

Boolean  variables  can  be  used  to  simulate  some  useful  concepts. 

1.  Integers  in  the  range  TO,  2n  -  1)  can  be  stored  in  n  Boolean 
variables.  It  is  a  straightforward  exercise  to  write  '’programs'*  which  In¬ 
crement  variables  mod  2°  ,  or  add  them  mod  2n  ,  or  test  for  zero,  A 

"program"  in  tills  context  is  a  BRE  or  BFA,  depending  on  which  model  is  being 


used,  whose  language  Is  (A)  ,  and  which  has  a  desired  side  effect  on 
some  Boolean  variables.  For  example ,  suppose  x  is  stored  in  ^ 

. . .  .  "x  «-  x  +  1  mod  2n"  is  accomplished  by  the  program 

yt  *  (??  u  y?(x^?  •  Xjf  u  x^?  •  x^t  •  y*)) 

•  •  •  • 

•  ( y ?  u  y?(x  ?  x  *  u  x  ?  x  t  y  +  ))  . 
n  n  nwJ 

The  extra  variable  y  holds  the  carry  bit.  Each  line  does  one  stage  of 
a  ripple  through  increment. 

2.  It  is  possible  to  write  a  BRE  for  En  «  E  •  . . .  •  E  (n  times), 

which  has  length  only  f(E)  +  O(log  n)  .  Using  I  log  nl  +  1  variables  to 
hold  an  integer  x  in  the  range  10,  n]  ,  we  write  Eu  as 

(x  0)  •  ((x  4  n?)  •  E  •  (x  «-  x  +  1))*  *  (x  *  n?)  . 

3.  Subroutine  calls  to  bounded  depth  with  bounded  parameters  can  be 
Implemented  in  BFA.  Simply  store  a  "return  address1'  in  certain  variables 
before  transferring  to  a  certain  section  of  the  BFA.  When  the  return  state 
is  encountered,  test  the  value  of  the  return  address.  In  Section  3  we  show 
that  BRE  can  efficiently  (l.e.  with  a  factor  of  log  n  size  Increase)  sim¬ 
ulate  BFA.  Using  that  simulation,  subroutines  can  be  Implemented  in  BRE. 

Programs  in  Propositional  Dynamic  Logic  (PDL)  are  very  close  to  regular 
expressions.  *  is  the  sequencing  operator,  u  the  nondeterminlstlc  choice 
operator,  and  *  the  looping  operator.  Boolean  variables  can  be  added  to 
such  programs  very  much  the  way  they  are  added  to  regular  expressions.  The 
programs  xt  and  x4-  are  thought  of  as  assignment  programs.  The  extension 

B-PDL  (Boolean  PDL)  of  PDL  is  studied  In  Abrahamson  (1980),  There  Boolean 

variables  are  used  to  express  properties  in  B-PDL  which  are  also  expressible 
in  PDL,  but  not  obviously  so. 


2.  FORMAL  DEFINITIONS  OF  BOOLEAN  REGULAR  EXPRESSIONS  AND  FINITE  AUTOMATA 


Let  B  be  a  finite  set  of  Boolean  variables,  and  let  £  be  a  finite 
alphabet*  Let  T  *  (xt,  xl,  x?#  Jt?  :  x  e  B}  .  A  Boolean  regular  expression 
(BRE)  over  alphabet  £  and  variables  B  is  any  well  formed  expression 
formed  from  members  of  £  u  r  ,  the  dyadic  operators  u  and  *  ,  and  the 
monadic  operator  *  .  The  unfiltered  language  U(E)  of  expression  E  is 
defined  inductively  by  the  following  rules. 

U(o)  «  {a}  for  a  e  £  u  r  t 
U(E  u  F)  »  U(E)  u  U(F)  t 

U(E  •  F)  *  U(E)  *  U(F)  (concatenation  of  sets), 

U(E*)  =*  U(E)*  (Kleene  closure)  . 

Let  A  -  £  u  r  and  A  *  A  -  {xt,  xt }  .  R  is  the  set  of  strings  which  are 

x  x 

inconsistent  with  respect  to  variable  x  ,  and  is  defined  by  the  regular  ex¬ 
pression 

R  *  (A*  •  xt  •  A*  •  x?  •  A*) u  (A*  *  xt  *  A*  "  x?  •  A*)u  (A*  •  x?  •  A*). 

X  x  x  '  x 

The  language  L(E)  of  BRE  E  is  defined  by 

L(E)  -  H(U(E)  -  u  R  )  , 
xtB  X 

where  H  just  erases  members  of  F  from  strings. 

Example.  The  expression  xt  •  yt  •  (y?  •  E  •  (x?  •  xt  u  x?  •  yt))*  •  y?  , 
where  E  is  any  expression,  represents  the  same  language  as  E  •  E  .  Note 
that  E  appears  only  once.  If  E  is  a  large  expression,  the  expression  given 
here  is  nearly  half  as  short  as  E  *  E  •  The  reader  may  find  it  helpful  to 
generate  E  *  E  from  this  expression. 
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A  Boolean  fij*lt_e  j*utonyton  (BFA)  is  a  b-tuple  (V,  Z,  B,  6,  Q,  v^)  , 

where 

V  is  a  set  of  vertices,  or  states, 

t  V  is  a  distinguished  start  state, 

Q  £  V  is  a  set  of  accepting  states, 

I  is  a  finite  alphabet, 

B  is  a  finite  set  of  Boolean  variables, 

6  e  V  x  (Z  u  r)  x  V  is  a  set  of  transitions,  or  arcs, 

A  computation  of  machine  M  =  (V,  Z,  B,  6,  Q,  v^)  is  a  finite  sequence  C 
of  members  of  6  beginning  on  (v^,  °*  u)  for  some  o  t  Z  u  T  and  u  ;  V  , 
ending  on  (u,  o,  v)  for  some  ueV,aeZur  and  v  e  F  ,  and  satisfy¬ 
ing  the  constraints 


i) 

if 

(u, 

x+,  u')  t 

(v,  x?,  v') 

is  a 

contiguous 

subsequence  of  ('  , 

then 

t 

contains 

(w,  xl. 

w')  for  some  w 

and  w 1  ; 

2) 

if 

<U, 

xf,  u')  t 

(v,  X?,  v') 

is  a 

contiguous 

subsequence  of  C  , 

then 

t 

contains 

(w,  xt. 

wf)  for  some  w 

and  w 1  ; 

3) 

if 

t(v, 

X?,  v') 

is  a  prefix 

of  C 

,  then  t 

contains  (w,  xt,  w') 

for 

some 

w 

and 

w'  . 

4) 

if 

(u, 

0,  v)  (w, 

i  ,  x)  is  a 

subsequence  of 

C  ,  then  v  =  w  . 

The  string  accepted  by  C  is  the  sequence  of  all  members  of  Z  which 
appear  in  C  ,  in  the  order  in  which  they  appear.  The  language  L(M)  ac¬ 
cepted  by  M  is 

L(M)  =  {s  c  Z*  ;  s  is  accepted  by  some  computation  of  M}  .  Machine  M 
is  deterministic  if  for  every  state  u  , 
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2)  if  (u,  [  |  v)  l  6  for  f  t  T  *  then  (u,  t,  v)  is  the  only  member 
of  6  with  first  state  u  .  A  deterministic  BFA  is  called  a  DBFA,  while  a 
nondeterministlc  BFA  is  called  a  NBFA. 

3.  RESULTS 

Here  is  a  summary  of  the  results  contained  in  this  section.  Two  machines 
or  expressions,  or  a  machine  and  an  expression,  are  equivalent  if  they  repre¬ 
sent  (or  recognize)  the  same  language. 

1.  (Relations  between  BRE,  DBFA  and  NBFA)  We  find  a  remarkably  close 
relationship  between  BRE,  DBFA  and  NBFA,  much  closer  than  that  between  HE, 

DFA  and  NFA.  Recall  that  an  exponential  blowup  results  in  general  in  convert¬ 
ing  a  NFA  to  a  RE,  (see  Ehrenfeucht  and  Zeiger  (1976)),  or  in  converting  a 
NFA  to  a  DFA  (see  Aho,  Hopcroft  and  Ullman  (1974)). 

a)  For  every  length  n  BRE  with  m  Boolean  variables,  there  is  an 
equivalent  0(n)  state  BFA  with  m  Boolean  variables. 

b)  For  every  n  state,  b  arc  NBFA  with  m  Boolean  variables, 

there  is  an  equivalent  length  0(b  log  n)  BRE  with  m  +  O(log  n) 
Boolean  variables. 

c)  For  every  n  state,  b  arc  NBFA  with  ra  Boolean  variables, 

there  is  an  equivalent  0(b)  state  DBFA  with  0(m  +  n)  Boolean 
variables . 

2.  (Relations  between  BFA  and  FA,  and  between  BRE  and  RE)  We  find  that 
Boolean  variables  permit  concise  representations  of  some  languages.  Let  N 

be  the  maximum  over  all  BFA's  with  n  states  and  m  distinct  Boolean  variables 
of  the  number  of  states  in  a  minimal  equivalent  FA.  Similarly,  let  K  be  the 
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maximum  over  all  BRE's  with  length  n  and  m  distinct  Boolean  variables 
of  the  length  of  the  minimal  equivalent  RE. 

a)  N  i-  n2m  , 

b)  N  a  c*1  for  some  c  >  1  and  some  m  =  0(n) 

n2m 

c)  K  ^  c  for  some  c  , 

2dn 

d)  K  2  for  some  d  >  0  ,  for  some  m  =  0(n)  . 

Theorems  are  numbered  corresponding  to  the  above  list. 

Theorem  la.  For  every  length  n  BRE  E  containing  m  Boolean  vari¬ 
ables,  there  is  an  0(n)  state  BFA  F  which  also  contains  m  Boolean 
variables,  such  that  L(F)  *  L(E)  , 

Proof.  The  proof  is  almost  identical  to  the  classical  proof  of  the  same 
theorem  for  m  ■  0  ,  which  can  be  found  in  Aho,  Hopcroft  and  Ullman  (1974). 


Theorem  lb.  For  every  NBFA  F  with  n  states,  b  transitions  and 
distinct  Boolean  variables  there  is  a  length  0(b  log  n)  BRE  E  with 
m  +  0(log  n)  Boolean  variables,  such  that  L(E)  »  L(F)  . 

Proof.  Let  the  states  of  F  be  numbered  0,  ...,  n  -  1  .  Let 
Xf,  . . . ,  x^  ,  k  *  f log  nl  ,  be  Boolean  variables  not  appearing  in  F  , 
which  are  used  to  represent  an  integer  x  in  the  range  [0,  ...,  n  -  1 


E  has  the  form  S 


S  sets  x  to  the  number  of  v 

o 


m 


Let  (u,  o,  v)  be  the  i^  transition  of  F  .  is  a  sequence  of  three 

actions:  1)  test  if  x  =  u  ,  2)  generate  a  ,  3)  set  x  *  v  .  Finally 

R  tests  if  x  e  Q  .  We  may  assume  without  loss  of  generality  that  b  -  n  , 
for  otherwise  some  states  are  inaccessible.  E  has  length  0(b  log  n)  ,  and 
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has  m  +  O(log  n)  Boolean  variables.  It  is  not  difficult  to  see  that 
L(E)  -  L(F)  . 

Corollary  1.  Every  regular  set  is  representable  by  a  BRE  which  con¬ 
tains  only  one  *  symbol. 

That  contrasts  with  ordinary  RE's,  which  require  an  unbounded  number 
of  stars  for  full  representational  power  (see  Cohen  (1970),  or  Ehrenfeueht 
and  Zeiger  (1976)). 

Proof.  The  expression  in  the  proof  of  Theorem  lb  has  only  one  * 
symbol . 

Theorem  lc.  For  every  NBFA  F  with  n  states,  b  arcs  and  m  distinct 
Boolean  variables,  there  is  a  DBFA  F*  equivalent  to  F  ,  with  0(b)  states 
and  0(m  +  n)  Boolean  variables. 

Proof.  Fr  performs  a  deterministic  simulation  of  F  .  F*  contains  a 

Boolean  variable  s  for  every  state  u  in  F  .  s  is  true  at  a  given 
u  u 

point  in  the  simulation  if  F  could  be  in  state  u  , 

F*  begins  by  setting  s  =  true  ,  and  s  =  false  for  all  v  i  v 

v -  v -  o 

o 

Then  Ff  enters  a  loop  in  which  it  repeatedly  performs  Step  1,  then  Step  2, 
until  the  input  is  exhausted. 

Step  1.  Perform  Boolean  variable  operations.  Before  reading  the  next 
character,  F*  needs  to  simulate  F  along  all  paths  on  which  no  characters 
are  read.  There  is  a  problem  here;  F  may  contain  a  loop  containing  no  read 
operations,  only  Boolean  variable  operations.  To  avoid  following  such  loops 
forever,  F1  maintains  a  count  r  in  variables  rp  •••■  r^  ,  k  «  I  ]og(n2m)  I  , 
of  the  number  of  consecutive  boolean  variable  operations  of  F  which  it  perl  onus. 
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When  r  reaches  n2m  ,  F1  Ignores  all  further  Boolean  variable  operations, 
and  LerminaLes  Step  1.  It  should  be  clear  to  the  reader  that  aborting  non¬ 
reading  loops  after  n2in  steps  cannot  change  the  language  of  F  .  After  all, 
if  F  follows  a  length  n2m  path,  then  some  state  must  have  been  reached 
twice  in  the  path  with  exact ly  the  same  Boolean  variable  values  both  times. 

So  there  is  a  shorter  path  which  reaches  the  same  state,  with  the  same  Boolean 
variable  values 4  as  the  longer  path. 

Step  1  is  defined  by  the  following  program.  Let  6  be  the  transition 
function  of  F  . 

for  r  :=0  to  n2m  -  1  do 

for  every  state  u  of  F  do 

if  s  then  begin 
u 

s  l  ; 
u 

for  every  (u,  t,  v)  e  6  ,  where  r  is  xt  or  x4  for 

some  x  ,  do 

begin  r  ;  end; 

for  every  (u,  x?,  v)  e  6  do 

if  x  then  s  t  : 

v  * 

for  every  (u,  x? ,  v)  t  6  do 

if  not  x  then  s  1  ; 

v 

if  (u,  o,  v)  l  6  for  some  o  t  Y.  ,  then 

end 
end  for 
endfor . 


i 


The  reader  should  be  able  to  convince  himself  that  the  above  program  correctly 
simulates  the  Boolean  variable  operations  of  F  . 
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1 

( 

I 

f 

I 

Step  2.  Perform  a  read  operation.  Step  2  is  defined  by  the  following  * 

t 

program,  using  auxiliary  variables  s^  for  every  state  u  of  F  . 

For  every  a  l  X  do 

if  the  next  input  character  is  o  then 

for  every  state  u  of  F  do 

if  s  then 
u 

for  every  (u,  o,  v)  l  6  do  sM  ; 

tor  every  state  u  in  F  do  s  +  s'  . 

u  u 

We  may  assume  without  loss  of  generality  that  b  n  and  b  t  m  .  Then 

machine  F  described  above  has  0(b)  states  and  ni  +  \  log  n2DI  I  +  2n  +  c  f 

Boolean  variables  (  c  variab  Jes  might  be  used  to  implement  the  for  loop 
on  r  )  . 

The  simulation  used  in  the  proof  of  Theorem  1c  is  based  on  a  commonly 
used  simulation  of  NFA,  in  which  the  simulating  machine  keeps  a  record  of 
which  states  the  NFA  can  be  in  (see  Aho,  Hopcroft  and  Ullman  (1974)).  The¬ 
orem  lc  presents  another  view  of  that  simulation:  convert  the  NFA  to  a  DtiFA, 
and  run  the  DBFA. 

Theorem  2a.  Every  n  state  NBFA  F  with  m  Boolean  variables  is 
equivalent  to  some  n2m  state  FA  Ff  . 

Informal  Proof.  F1  can  be  constructed  as  follows.  First,  make  2m 
copies  of  F  ,  one  for  each  subset  of  the  Boolean  variables  in  F  .  Elim¬ 
inate  xt  arcs  by  replacing  arc  (u,  xt,  v)  in  copy  s  l  B  by  a  A-arc 
from  u  in  copy  s  to  v  in  copy  s  u  {xl  .  Similarly,  replace  (u,  xl  ,  v) 
by  a  A-arc  from  u  in  copy  s  to  v  in  copy  s  -  lx)  .  Replace  (u,x?,  v) 

i 


i 
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in  copy  s  by  a  A-arc  t  still  within  copy  s  ,  iff  x  l  s  .  If  x  /  s  , 
delete  (u,  x?  ,  v)  from  copy  s  .  x?  arcs  are  eliminated  similarly.  When 
all  arcs  not  labeled  by  members  of  E  have  been  eliminated,  eliminate 
A-arcs  by  the  standard  method  (no  states  need  be  added).  It  is  left  to  the 
reader  to  verify  that  machine  F*  so  constructed  recognizes  the  same  lan¬ 
guage  as  F  . 


Theorem  2b.  For  all  sufficiently  large  n  there  is  an  n  state  DBFA 
F  which  is  not  equivalent  to  any  NFA  F*  of  fewer  than  c°  states  for 
some  c  , 

Proof .  Let  F^  be  the  machine  which  implements  the  following  program 

k 

using  k  boolean  variables  to  represent  an  integer  in  the  range  [0,  2  -  1 


for  i  :=  0  to  2  -  1  do  (read  a  ). 


recognizes  az  ,  and  has  length  dk  for  some  d  .  Clearly  F^  is  not 
equivalent  to  any  NFA  of  fewer  than  2  states.  Let  F  =  F  ,  padded  to 

I  “  ! 

LdJ 

exactly  n  states.  Then  F  is  equivalent  to  no  NFA  with  fewer  than  c11 
1 


+  e 


states  for  c  =  2 


Theorem  2c.  Every  length  n  BRE  E  with  m  Boolean  variables  is 

n  2m 

equivalent  to  some  length  c  RE  E'  for  some  constant  c  . 

Proof. 

1)  find  an  n  state,  m  variable  BFA  F  equivalent  to  E 
(possible  by  Theorem  la); 

2)  find  an  n2m  state  FA  equivalent  to  F  (possible  by 
Theorem  2a) ; 


i*;*/ 
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3)  1  ind  a  c  length  RE  E 1  equivalent  to  F  (possible 

by  the  standard  method  of  converting  FA  into  HE.  See 
Ehrenfeucht  and  Zeiger  (1976)). 

Before  proving  Theorem  2df  we  introduce  some  notation.  The  notions 
of  normality,  covering  and  index  are  from  Ehrenfeucht  and  Zeiger  (1976). 

Definition.  A  labeled  graph  is  a  directed  graph,  with  each  arc 
labeled  by  a  member  of  some  given  label  set.  There  is  at  most  one  arc 
between  two  given  nodes,  A  trail  in  a  labeled  graph  is  the  sequence  of 
labels  along  some  connected  path. 

Definition.  A  RE  E  is  normal  w.r.t.  graph  A  iff  there  are  func¬ 
tions  init  and  fin  from  subexpressions  of  E  to  nodes  in  A  such 

that 

1)  If  F  u  G  is  a  subexpression  of  E  9  then  init(F  u  G) 

=  init (F)  =  init (G)  and  fin(F  u  G)  =  fin(F)  =  fin(G)  . 

2)  if  F • G  is  a  subexpression  of  E  then  init(F*G)  *  init(F) 

and  fin(F-G)  =  fin(C)  . 

3)  If  F*  is  a  subexpression  of  E  then  init(F*)  *  Init(F) 

=  fin(F)  =  fin(F*)  .  init(F*)  is  called  the  base  point 
of  F*  . 

4)  If  F  is  a  subexpression  of  E  then  L(F)  is  contained 
in  the  set  of  trails  t rom  init(F)  to  fin(F)  . 

Definition.  A  RE  E  covers  string  s  if  s  is  a  contiguous  sub¬ 
string  of  some  member  of  L(E)  . 

Definition,  The  index  lg(E)  string  s  in  RE  E  is  the  largest 

k  2  0  such  that  E  covers  9  if  such  a  k  exists,  and  Is  when  no 

such  k  exists.  E  is  s-finlte  if  lg(E)  4  00  • 


i 


Definition.  A  labeled  graph  A  is  conservative  if  no  loop  path  in 
A  has  the  same  trail  as  any  other  path  Ju  A  .  Hence,  in  a  conservative 
graph,  a  loop  path  is  uni<pieiy  determined  by  its  tiail. 

Lemma  1.  Let  A  be  a  conservative  graph,  s  be  a  loop  trail  In  A  , 
and  E  be  normal  w.r.t.  A  .  If  E  is  s-finite,  then  1  (E)  •-  T(E)  (the 

b 

length  of  E  ) . 

Proof .  The  proof  is  by  induction  on  the  length  of  E  .  The  cases  a  , 
F  u  G  and  F-G  are  trivial,  given 

1)  I  (a)  =  0  or  1  for  all  s  , 

s 

2)  ls(F  u  G)  -  nmx(Ig(F),  lg(G))  , 

3)  I  (F-G)  I,(F)  +  1,(G)  +  1  , 

s  s  s 

all  of  which  are  obviously  true.  Suppose  E  is  F*  .  If  the  base  point  oi 


F*  is  not  on  the  loop  determined  by  s  ,  then  F*  cannot  go  around  s  any 

more  times  than  F  does.  Suppose  the  base  point  of  F*  is  on  the  loop 

determined  by  s  .  If  L(F)  contains  a  cyclic  permutation  of  s  ,  then  F* 

is  s-infinite,  contrary  to  assumption.  If  on  the  other  hand  L(F)  does 

not  contain  a  cyclic  permutation  of  s  ,  then  I  (F*)  1  l  (F)  +  1  ,  which  is 

s  s 

enough  to  finish  the  induction. 

Definition.  A  labeled  graph  A  is  forward  de t e rmini s t ic  (or  just  dele 
ministic)  if  for  every  node  u  and  label  o  ,  there  is  at  most  one  arc  labeled 

a  leaving  u  .  A  is  backward  deterministic  if  the  reversal  of  A  ,  obtained 

by  reversing  all  of  the  arcs  in  A  ,  is  deterministic. 

Lemma  2.  Let  A  be  forward  and  backward  deterministic,  and  RE  K  be 
such  that  L(E)  is  contained  in  the  set  of  trails  from  u  to  v  in  A  , 
for  some  nodes  u  and  v  ,  Then  E  is  normal  w.r.t.  A  . 
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Pruot  .  Let  K  have  length  n  ,  and  subexpression  K  of  K  have 
length  k  •  We  define  inlt(K)  and  fin(F)  inductively  on  n  -  k  .  For 
k  -  n  (K  =  K)  ,  in it  IK)  -  u  and  fin(F)  =  v  . 

G  u  H.  Suppose  init(G  o  II)  and  fin(C  u  H)  are  already  defined. 
Define  inlt(G)  «  init(H)  -  init(G  u  H)  and  fiinG)  =  fin(H)  *  fin(G  u  H)  . 
Trivially,  L(G)  l  L(G  u  H)  ^  trails  from  init(G)  to  fin(G)  ,  by  induc¬ 
tion.  Similarly  for  II  . 

G*.  Suppose  init(G*)  -  fin(G*)  has  been  defined.  Define  init(G) 

■  init(G*)  *  fin (('»*)  »  fin(G)  .  Trivially,  L(G)  l  L(G*)  ^  trails  from 
init(G)  to  fin(G)  ,  by  induction. 

G*H.  Suppose  init(G*H)  and  fin(G*H)  have  been  defined  so  that 
L(C*H)  w  trails  from  init(G*H)  to  fin(G*H)  .  Define  init(G)  *  init(G'll) 
and  fin(H)  «  fin(G*H)  .  Suppose  L(G)  contains  two  strings  and  g,  ; 

and  L(H)  contains  h  .  A  must  contain  the  subgraph  depicted  below,  since 
g  *h  and 

1  /r\ 


init(C) 


fin(H) 


g^*h  are  both  in  L(G *H)  .  By  forward  determinism,  u  and  v  are  unique, 
By  backwards  determinism,  u  *  v  .  Similarly,  suppose  L(H)  contains  h 


and  h^  ,  and  L(G)  contains  g  .  Then  A  must  contain  the  subgraph 


ini t (G) 


I  ln(H) 


. . - . -  -sjjfcaaEaa 


-16- 


By  backward  determinism,  u  and  v  are  unique.  By  forward  determinism, 

u  -  v  .  Hence,  we  see  that  all  of  the  paths  in  L(C)  starting  at  init(C) 
must  end  on  u  -  v  ,  which  is  just  where  all  of  the  paths  in  L(H)  start. 
Define  fin(C)  *  u  »  v  -  inil(il)  .  Then  L(C)  ^  trails  from  init(G)  to 

fin(C)  ,  and  L(H)  l  trails  from  init(H)  to  fln(H)  . 

Ehrenfeucht  and  Zeigcr  show  that  no  short  regular  expression  can 
represent  all  of  the  trails  between  two  given  nodes  in  the  complete 
graph,  where  each  arc  bears  a  distinct  label.  We  wish  to  modily  the  com¬ 
plete  graph  in  such  a  way  that  it  is  still  hard  to  describe  for  ordinary 
regular  expressions,  but  Is  easy  tor  Boolean  regular  expressions.  We  do 
that  by  ettectively  replacing  the  arc  I  ron,  u  to  v  1) y  a  path  whose 
—  K 

trail  is  // v  $  u  where  u  and  v  are  the  binary  representations  oi  u 

and  v  ,  respectively.  The  nudes  which  correspond  to  nodes  in  the  com¬ 
plete  graph  are  called  primary  nodes;  those  which  are  on  paths  between 
primary  nodes  are  called  secondary  nodes. 

Definition.  The  order  n  fan  graph  FAN^  is  defined  for  n  a  power  oi 
2  as  follows.  There  are  n  primary  nodes,  with  4n  -  J  secondary  nodes 
associated  with  each  primary  node.  A  typical  primary  node  and  its  harem 
of  secondary  nodes  is  pictured  on  the  following  page.  Ail  arcs  point  down. 
Tlie  primary  nodes  are  numbered  from  0  to  n  -  1  .  The  $  arcs  are  con- 

g 

nected  in  such  a  way  that  there  is  a  trail  //v  $  u  from  primary  node  u 
to  primary  node  v  ,  where  u,  v  c  (0,  ...,  n  -  i),  and  u  (v)  is  the 
log  n  digit  binary  representation  of  u  (v)  . 
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n  top  level  nodes 


£  |$  {$  j$  j$ 


primary  node 


Definition .  An  m-  restrict  ion  of  FAN^  •  for  ni  i  n  ,  Is  a  subgraph 
of  FAN^  containing  m  primary  nodes,  such  that  the  length  21og  n  +  2 
paths  between  existing  primary  nodes  are  intact,  and  all  extraneous  nodes 
are  absent. 

If  A  is  an  m-rest riction  of  FAN  ,  then  A  is  clearly  forward  and 

n  ’ 

backward  detenninist ic .  A  is  also  conservative,  for  every  loop  passes 
through  a  primary  node,  and  immediately  aftei  leaving  a  primary  node,  a  trail 
names  the  node  Just  left.  The  //  symbol  indicates  the  start  of  a  node  name. 

k 

Theorem  3.  For  every  n  *  2  ,  every  m-rest rict ion  A  of  FAN^  * 
m  s  n  ,  and  every  primary  node  u  in  A  ,  there  is  a  trail  p  in  A  from 
u  to  itself  such  that  If  E  is  any  regular  expression  covering  p  ,  and 
which  is  normal  w.r.t.  FAN^  •  then  £(E)  *  2m  . 


i:>  .  '  Wll— rtUr  .  itkT  r  ■* . *  *v  + 


Jn  what  follows,  "normal 11  means  normal 


Proof  *  by  induction  on  tu  . 

w.r.t.  KAN 

n 

“  1 .  A  1-rest  rict  ion  of  FAN^  ius^  a  1°°P  containing  2log  n  +  2 

nodes.  That  loop  lias  length  at  least  2 01  *  2  . 

m  *  1.  Let  u  .  u  ,  be  the  primary  nodes  in  A  .  Let  A.  , 

1  *  0 . m  -  I  be  tlie  (w  -  I) -restrict ion  of  FAN  obtained  by  deleting 

and  associated  secondary  nodes  from  A  .  Let  o  and  &  be  substruction 
and  addition  mod  m  .  by  induction,  every  contains  a  trail  p ^  from 

to  itself  such  that  every  normal  L  covering  p  has  length  at  least 

2™”^  .  For  j  *  0,  ...»  m  -  lf  define 


qj  =  (pj®d  aj  > J (PjW) 


J*»-1,J  ’ 


where  k  *  2  and  a^  Is  a  length  21og  n  +  2  trail  t  rum  node  i  to 
node  j  .  is  a  Joop  from  node  j  to  ItsellJn  A  . 

To  prove  Theorem  i  we  need  to  show  tit  at  ,  lor  j  *  0f  .  ..,  in  -  1,  it 


is  normal  and  covers  q  ,  then  fit)  2,U  .  Suppose  1  is  normal  ami  covers 

q.  *  Then  for  every  i  ,  J  (L)  2m  .  11  K  is  p  -finite  for  some  i  , 

J  P|  * 


taen  by  Lemma  I  f(t)  2ni  ,  which  is  what  we  want  to  show. 

Suppose  on  the  other  hand  that  t  is  p^-infinite  lor  all  i  .  In 
what  follows,  "minimal"  means  "minimal  with  respect  to  the  relation  *  sub¬ 
expression  oifH •  Since  t  has  iinitely  many  subexpressions,  tor  each  i 

there  is  a  minimal  subexpression  o!  K  which  Is  p^-inlinite  .  Such  a  sub- 

*  ^ 

expression  must  be  a  star,  say  F^  •  Choose  a  minimal  F ^  from  among  the 
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*  * 

F  is  normal  and  covers  p  in  A  ,  so  by  induction  t(F  ) 


The  subexpression  of  E  has  a  base  point  v  in  FAN^  by  the 

normality  of  E  .  The  path  for  which  p^  is  a  trail  must  pass  through  the 
*  * 

base  point  of  ,  for  otherwise  either  would  not  be  p^-infinite  , 

or  would  be  p^  infinite,  violating  the  minimality  of  Ff*  .  Hence, 

v  is  on  the  path  determined  by  p^  ,  that  is,  v  is  in  A  . 

*  * 

Let  E/F^  be  obtained  from  E  by  replacing  F^  by  the  expression 

A  representing  the  null  string.  By  definition,  -£(A)  «  0  .  By  supposition 

E  is  p^-infinite.  But  the  path  determined  by  pv  does  not  pass  through 

* 

node  v  ,  while  every  member  of  determines  a  path  which  does  pass 

i  *  * 

through  v  .  Hence,  E/F^  must  still  be  p^-infinite  .  Thus  E/F^ 
covers  in  A^  ,  and  by  induction  ^(E/F^*)  £  2m  *  .  Putting  that  tact 

together  with  ^(F^*)  ^  gives  the  desired  result,  E)  ^  2m  . 

Theorem  2d.  For  infinitely  many  m  there  is  a  BRE  E  of  length  m 

2<lm 

which  is  not  equivalent  to  any  RE  E1  of  length  less  than  2  for  some 


d  >  0  . 

Proof.  Let  E  be  a  BRE  for  the  set  of  all  trails  from  primary  node  0 
in  FAN^  to  itself.  E  can  use  variables  x^,  ...,  x^  and  y^,  ...,  , 

k  *  log  n  ,  to  represent  integers  x  and  y  respectively,  in  the  range 
[0,  ...,  n-1  ]  .  Let  "x*- random"  be  the  expression  (x^t  u  x^4  )•(•••)•(  x  t 
u  x^l)  •  Let  "generate  x"  be  the  expression  (x^?*0  u  x^?*l) •(■••)  (x^?*0 
u  x^?*l)  ,  and  "generate  yR"  be  (y^f’O  u  y^?  •  1)  •  ( *  *  * )  •  (y^7  *0  u  y^?*l)  , 


-20- 


E  Is  defined  by  ! 

K  A 

E  *  (y<  0)  *(  (x<  random)  •//•  (generate  x)  •  $•  (generate  y  )  ■  (>r*  x)  )  (y  -  0?)  . 

E  has  length  c  log  n  tor  some  c  *  Suppose  E1  is  equivalent  to  E  .  j 

By  Lemma  2  E'  is  normal  w.r.t.  FAN  .  By  Theorem  3*  T(E')  *  2n  .  Let 

m  -  c  log  n  .  Then  ^(E 1 )  2l  for  d  =»  1/c  . 

This  concludes  the  proof  of  Theorem  2d. 

The  result  of  Ehrenf ouch t  and  Zeiger,  that  the  set  of  paths  in  some  n  / 

node  graphs  can  only  be  described  by  length  2n  ^  regular  expressions,  re-  ■ 

quires  alphabets  of  unbounded  size.  As  the  proof  given  here  uses  a  4-svmbol 

i  : 

alphabet,  we  can  give  a  bound  which  applies  when  the  alphabet  size  is  bounded. 

2  1 

However,  because  FAN^  has  0(n  )  nodes,  the  result  for  bounded  alphabets 

is  weaker  than  that  for  unbounded  alphabets. 

Theorem  4.  For  infinitely  many  m  there  is  an  m  state  FA  F  over 
a  4-element  alphabet  which  Is  equivalent  to  no  RE  E  of  length  less  than 
d^™*  for  some  d  >  1  . 

Proof.  Let  F  be  the  automaton  induced  by  FAN  ,  where  node  l)  is 
-  n 

both  the  start  and  final  state.  F  lias  cn^  states  for  some  c  .  Any  RE 

E  for  the  set  of  trails  from  node  0  to  itself  must  have  length  at  least  ' 

2°  .  Let  m  =  cn^  .  Then  L(E)  d*^”"  for  d  *  2 *  . 


4.  OPEN  QUESTIONS 


The  lower  bounds  2b  and  2d  require  length  n  expressions  with  0(n) 
Boolean  variables.  Is  it  possible  to  make  the  number  of  Boolean  variables 
a  parameter,  which  can  vary  independently  of  n  ,  as  in  Theorems  2a  and  2t 
Our  upper  bound  on  Boolean  variables  removal  from  regular  expressions  Is  vi i v 


r 


H 

H 

i] 
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poor  when  m  =  0  ,  for  our  method  causes  an  exponential  blowup  when  no 
work  at  ail  is  required.  Suppose  m  is  not  0  ,  but  is  small ,  for  in¬ 
stance  m  *  1  •  Then  is  the  exponential  blowup  really  required?  If  so, 
then  a  single  Boolean  variable  is  extremely  powerful.  If  not,  then  where 
does  the  double  exponential  behavior  of  Theorem  2d  take  hold? 

The  d*^"  lower  bound  of  Theorem  4  is  weaker  than  the  2m  *  bound 
proved  by  Ehrenfeucht  and  Zeiger  for  unbounded  alphabets.  Can  a  dm 
bound  be  proved  for  bounded  alphabets? 
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